6 research outputs found
''Fifty Shades of Bias'': Normative Ratings of Gender Bias in GPT Generated English Text
Language serves as a powerful tool for the manifestation of societal belief
systems. In doing so, it also perpetuates the prevalent biases in our society.
Gender bias is one of the most pervasive biases in our society and is seen in
online and offline discourses. With LLMs increasingly gaining human-like
fluency in text generation, gaining a nuanced understanding of the biases these
systems can generate is imperative. Prior work often treats gender bias as a
binary classification task. However, acknowledging that bias must be perceived
at a relative scale; we investigate the generation and consequent receptivity
of manual annotators to bias of varying degrees. Specifically, we create the
first dataset of GPT-generated English text with normative ratings of gender
bias. Ratings were obtained using Best--Worst Scaling -- an efficient
comparative annotation framework. Next, we systematically analyze the variation
of themes of gender biases in the observed ranking and show that
identity-attack is most closely related to gender bias. Finally, we show the
performance of existing automated models trained on related concepts on our
dataset.Comment: Camera-ready version in EMNLP 202
Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models
Leveraging shared learning through Massively Multilingual Models,
state-of-the-art machine translation models are often able to adapt to the
paucity of data for low-resource languages. However, this performance comes at
the cost of significantly bloated models which are not practically deployable.
Knowledge Distillation is one popular technique to develop competitive,
lightweight models: In this work, we first evaluate its use to compress MT
models focusing on languages with extremely limited training data. Through our
analysis across 8 languages, we find that the variance in the performance of
the distilled models due to their dependence on priors including the amount of
synthetic data used for distillation, the student architecture, training
hyperparameters and confidence of the teacher models, makes distillation a
brittle compression mechanism. To mitigate this, we explore the use of
post-training quantization for the compression of these models. Here, we find
that while distillation provides gains across some low-resource languages,
quantization provides more consistent performance trends for the entire range
of languages, especially the lowest-resource languages in our target set.Comment: 16 Pages, 7 Figures, Accepted to WMT 2022 (Research Track
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?
Large Language Models (LLMs) have demonstrated impressive performance on
Natural Language Processing (NLP) tasks, such as Question Answering,
Summarization, and Classification. The use of LLMs as evaluators, that can rank
or score the output of other models (usually LLMs) has become increasingly
popular, due to the limitations of current evaluation techniques including the
lack of appropriate benchmarks, metrics, cost, and access to human annotators.
While LLMs are capable of handling approximately 100 languages, the majority of
languages beyond the top 20 lack systematic evaluation across various tasks,
metrics, and benchmarks. This creates an urgent need to scale up multilingual
evaluation to ensure a precise understanding of LLM performance across diverse
languages. LLM-based evaluators seem like the perfect solution to this problem,
as they do not require human annotators, human-created references, or
benchmarks and can theoretically be used to evaluate any language covered by
the LLM. In this paper, we investigate whether LLM-based evaluators can help
scale up multilingual evaluation. Specifically, we calibrate LLM-based
evaluation against 20k human judgments of five metrics across three
text-generation tasks in eight languages. Our findings indicate that LLM-based
evaluators may exhibit bias towards higher scores and should be used with
caution and should always be calibrated with a dataset of native speaker
judgments, particularly in low-resource and non-Latin script languages
MEGA: Multilingual Evaluation of Generative AI
Generative AI models have shown impressive performance on many Natural
Language Processing tasks such as language understanding, reasoning, and
language generation. An important question being asked by the AI community
today is about the capabilities and limits of these models, and it is clear
that evaluating generative AI is very challenging. Most studies on generative
LLMs have been restricted to English and it is unclear how capable these models
are at understanding and generating text in other languages. We present the
first comprehensive benchmarking of generative LLMs - MEGA, which evaluates
models on standard NLP benchmarks, covering 16 NLP datasets across 70
typologically diverse languages. We compare the performance of generative LLMs
including Chat-GPT and GPT-4 to State of the Art (SOTA) non-autoregressive
models on these tasks to determine how well generative models perform compared
to the previous generation of LLMs. We present a thorough analysis of the
performance of models across languages and tasks and discuss challenges in
improving the performance of generative LLMs on low-resource languages. We
create a framework for evaluating generative LLMs in the multilingual setting
and provide directions for future progress in the field.Comment: EMNLP 202
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
We present Samanantar, the largest publicly available parallel corpora
collection for Indic languages. The collection contains a total of 49.7 million
sentence pairs between English and 11 Indic languages (from two language
families). Specifically, we compile 12.4 million sentence pairs from existing,
publicly-available parallel corpora, and additionally mine 37.4 million
sentence pairs from the web, resulting in a 4x increase. We mine the parallel
sentences from the web by combining many corpora, tools, and methods: (a)
web-crawled monolingual corpora, (b) document OCR for extracting sentences from
scanned documents, (c) multilingual representation models for aligning
sentences, and (d) approximate nearest neighbor search for searching in a large
collection of sentences. Human evaluation of samples from the newly mined
corpora validate the high quality of the parallel sentences across 11
languages. Further, we extract 83.4 million sentence pairs between all 55 Indic
language pairs from the English-centric parallel corpus using English as the
pivot language. We trained multilingual NMT models spanning all these languages
on Samanantar, which outperform existing models and baselines on publicly
available benchmarks, such as FLORES, establishing the utility of Samanantar.
Our data and models are available publicly at
https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance
research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational
Linguistics (TACL
Multi-lingual Speech to Speech Translation for Under-Resourced Languages
This report describe the research done during the first ESPERANTO/JSALT workshop from the13th June 2022 to the 5th of August 2022